The Journal of Toxicological Sciences
Online ISSN : 1880-3989
Print ISSN : 0388-1350
ISSN-L : 0388-1350
Original Article
Developing a GNN-based AI model to predict mitochondrial toxicity using the bagging method
Yoshinobu IgarashiRyosuke KojimaShigeyuki MatsumotoHiroaki IwataYasushi OkunoHiroshi Yamada
Author information
JOURNAL FREE ACCESS FULL-TEXT HTML

2024 Volume 49 Issue 3 Pages 117-126

Details
Abstract

Mitochondrial toxicity has been implicated in the development of various toxicities, including hepatotoxicity. Therefore, mitochondrial toxicity has become a major screening factor in the early discovery phase of drug development. Several models have been developed to predict mitochondrial toxicity based on chemical structures. However, they only provide a binary classification of positive or negative results and do not provide the substructures that contribute to a positive decision. Therefore, we developed an artificial intelligence (AI) model to predict mitochondrial toxicity and visualize structural alerts. To construct the model, we used the open-source software library kMoL, which employs a graph neural network approach that allows learning from chemical structure data. We also utilized the integrated gradient method, which enables the visualization of substructures that contribute to positive results. The dataset used to construct the AI model exhibited a significant imbalance, with significantly more negative than positive data. To address this, we employed the bagging method, which resulted in a model with high predictive performance, as evidenced by an F1 score of 0.839. This model can also be used to visualize substructures that contribute to mitochondrial toxicity using the integrated gradient method. Our AI model predicts mitochondrial toxicity based on chemical structures and may contribute to screening mitochondrial toxicity in the early stages of drug discovery.

INTRODUCTION

Mitochondria are organelles that provide cells with energy produced via oxidative phosphorylation and play an important role in maintaining life. Drugs that adversely affect mitochondrial function can cause toxicity in various organs. For example, cerivastatin was withdrawn from the market because it causes rhabdomyolysis, and troglitazone and tolcapone were withdrawn because they cause liver damage. All of these drugs have been reported to induce mitochondrial toxicity (Sakamoto and Kimura, 2013; Jaeschke, 2007; Grünig et al., 2017). Therefore, in vitro mitochondrial toxicity assays are significant screening tools in the early stages of drug development (Dykens and Will, 2007; Will and Dykens, 2014). In addition to in vitro evaluation methods, in silico evaluation methods are important for efficient drug development. Although several models (Zhang et al., 2009; Hemmerich et al., 2020; Tang et al., 2020; Zhao et al., 2021; Bringezu et al., 2021; Jaganathan et al., 2022) have been developed to predict mitochondrial toxicity based on chemical structure, they cannot provide structural alerts, which contribute to structure-activity relationship (SAR) studies. Therefore, we developed an artificial intelligence (AI)-based mitochondrial toxicity prediction model that predicts toxicity from chemical structures and also possesses the capacity to visualize substructures contributing to toxicity. We employed graph neural networks (GNNs) including graph convolutional networks (GCNs) as deep learning techniques.

In recent years, deep learning-based AI techniques have been applied to toxicology research. GNN is one such method that learns the latent vectors of nodes in graphs (Kipf and Welling, 2016; Wu et al., 2019; Morris et al., 2019; Li et al., 2020; Brody et al., 2021). Traditional models that predict toxicity based on chemical structures have been constructed using molecular descriptors, a feature value calculated based on a Simplified Molecular Input Line Entry System (SMILES), which is a string of alphanumeric characters representing chemical structures. Thus, different training data are employed depending on the method and can be used to derive the molecular descriptors, even for the same chemical structure. A GNN can directly learn from the chemical structure itself without converting it into other forms, such as descriptors, allowing the GNN to eliminate biases introduced by the choice of molecular descriptor calculation method. Additionally, the integrated gradient (IG) method (Sundararajan et al., 2017) can be combined to visualize the basis for predictions, which has been a challenge in developing predictive models using AI technology. The software package kMoL (https://github.com/elix-tech/kmol), in which the GNN and IG functions were implemented, was used to develop an AI model for predicting mitochondrial toxicity. The kMoL software was based on kGCN constructed in a previous study (Kojima et al., 2020) and developed by Kojima and Okuno of Kyoto University and Elix Corporation.

A major issue in developing toxicity prediction models is the imbalance between the amounts of positive and negative data used for training. In many cases, particularly for publicly available toxicity data for pharmaceuticals, negative data are overwhelmingly larger than positive data. This imbalance in training data can sometimes lead to poor performance of toxicity prediction models. In our study, AI models constructed without addressing the imbalanced dataset did not exhibit adequate predictive performance. Therefore, we used the bagging method (Breiman, 1996) by under-sampling from a majority class (Wang et al., 2009; Wallace et al., 2011) to develop AI models with a high predictive performance. The bagging method is an ensemble technique that trains and aggregates multiple predictive models. Each model was trained on a randomly resampled subset of the original dataset and the final prediction was determined using the average or majority vote of the model predictions. This allowed imbalanced datasets to be trained as balanced datasets. Additionally, it reduced overfitting and enabled a more stable and reliable prediction. This study provides an example of the use of imbalanced datasets in developing AI models.

Hepatotoxicity is a typical organ toxicity that is attributed to mitochondrial toxicity (Mihajlovic and Vinken, 2022). The mechanisms underlying hepatotoxicity are complex and diverse. To conduct a SAR study of hepatotoxic compounds, selecting a screening method tailored to the toxicity expression mechanism is essential. In this study, we discuss applications of our AI model to elucidate the mechanism underlying hepatotoxicity induction.

We developed a GNN-based AI model to predict mitochondrial toxicity using the bagging method. Our AI model not only predicted toxicity but was also able to visualize the substructures possibly involved in mitochondrial toxicity and their contribution using the IG method. Our AI model was developed as part of a collaborative industry-academic research project. This initiative focused on creating advanced AI systems to support early-stage drug discovery research via leveraging the significant advancements in AI technology observed in recent years. The developed AI model will be integrated into the project’s AI system for practical applications. In this paper, we present the results of our study and discuss the advantages of our AI model.

MATERIALS AND METHODS

Data preparation

Data sources and number of compounds used to construct the mitochondrial toxicity prediction AI model are listed in Table 1 (Hallinger et al., 2020; Rana et al., 2019; Zhang et al., 2009). If SMILES were unavailable from the data source, they were identified using PubChem. Compounds that induced mitochondrial toxicity in in vitro studies (Table 1) were labeled as positive, while those that did not induce toxicity were labeled as negative. If a molecule had multiple fragments, the “parent” molecule was used; SMILES with multiple labels were excluded because each SMILES must be uniquely labeled. SMILES were standardized using a standardizer in the ChEMBL structure pipeline (Bento et al., 2020). Finally, 896 positive and 5,170 negative compound data points were collected.

Table 1. Data source and number of compounds for the mitochondrial toxicity model.

Author Data source Number of positive and negative compounds In vitro mitochondrial toxicity assay methods
U.S. Tox 21 program https://pubchem.ncbi.nlm.nih.gov/bioassay/720637 638:5027 Quantitative high-throughput screening (qHTS) assay for small molecule disruptors of the mitochondrial membrane potential
Hallinger R.D., et al. https://pubmed.ncbi.nlm.nih.gov/32374859 241:787 Respirometric screening assay
Rana P., et al. https://pubmed.ncbi.nlm.nih.gov/30525499 54:174 Respirometric screening assay
Assessment of mitochondrial
toxicity using glucose/galactose
model
Zhang H., et al.* https://pubmed.ncbi.nlm.nih.gov/18940245 45:(0) Mitochondrial toxic compound dataset constructed by Zhang et al. based on various literature information (e.g., measurements of mitochondrial membrane potential, mitochondrial oxygen consumption, etc.).
Total without duplicates 896:5170

*Only positive data were employed.

Visualization of chemical space

Pharmaceutical SMILES data were collected from https://open.fda.gov/apis/drug/ndc/download/. From 56,651 entries in ‘HUMAN OTC DRUG’ and 52,198 entries in ‘HUMAN PRESCRIPTION DRUG,’ the ‘active_ingredients’ field was extracted. After removing duplicates, 2,826 unique active ingredient records were identified. Drug SMILES were identified using PubChem and standardized using the ChEMBL structure pipeline standard. Chemical spaces were visualized using ChemPlot (Cihan Sorkun et al., 2022) and a Uniform Manifold Application and Projection (UMAP) (McInnes et al., 2020). UMAP is a method of projecting high-dimensional data to a lower dimension, allowing visualization of relationships between the data.

Computational resources for model construction

Model construction was carried out on a desktop computer equipped with AMD EPYC 7542 (2.90GHz) and 128 GB RAM. For model training, we utilized NVIDIA RTX A6000 and Quadro RTX 8000 GPUs, under an environment of CUDA 11.7 and PyTorch 1.13.1. The training, including a hyperparameter search for a single sub-model, took an average of 55 hr, and the total process consumed approximately 500 hr.

Deep learning software package

The AI models were constructed using kMoL version 1.1.5 (https://github.com/elix-tech/kmol) and ChemPlot 1.2.0 (https://chemplot.readthedocs.io/en/latest/).

AI model construction

The AI model was constructed using the GNN approach via two distinct methods: one without splitting the dataset and the other via splitting the dataset into subsets (bagging method).

For the method without splitting the dataset, the training, validation, and test sets were partitioned at a ratio of 8:1:1 using random sampling. Hyperparameter tuning was conducted using the training and validation sets with 10-fold cross-validation. A hyperparameter search was performed using the tree-structured Parzen estimator algorithm implemented in Optuna (Akiba et al., 2019) and AdaBelief (Zhuang et al., 2020) was used for optimization.

For the bagging method, an AI model was constructed using the following procedure (Fig. 1). The AI model consisted of nine sub-models. The final judgment result was determined using the majority vote of the sub-model predictions.

Fig. 1

Schema for Data Partitioning and AI Model Construction. Detailed information is provided in the methods section of the text.

1. Data partitioning: 100 entries each of positive and negative data were randomly sampled from the total data and assigned to the ‘external_test’ set. The remaining data were used for training the AI model.

2. Creation of sub-datasets: Nine sub-datasets were created for AI model construction. Positive datasets, which were fewer in number, were common among the nine sub-datasets. Negative data were assigned to each sub-dataset such that they were equal in number to the positive data. Negative data were sampled randomly to minimize data overlap among the sub-datasets.

3. Sub-dataset split: Each sub-dataset was split into training, validation, and internal test sets at a ratio of 8:1:1.

4. Sub-AI model training: The model was trained and its hyperparameters were tuned via a 10-fold cross-validation using a training set.

5. Validation of the sub-AI models: The discriminant performance of the sub-models was evaluated using a validation set.

6. Models were selected for internal testing using the F1 score as an indicator; the top 10 models were selected to increase the F1 score.

7. Internal testing: An internal test set was evaluated on 10 selected models.

8. Determining the best sub-AI model: The model with the highest F1 score was selected for internal testing.

9. Final validation of the AI model: The discriminant performance of the AI model consisting of nine sub-models was evaluated using an external test set.

Evaluation of predictive performance

The metrics used in the present study were as follows.

The area under the receiver operating characteristic curve (ROC-AUC) and the area under the precision-recall curve (PR-AUC) were used to evaluate models, using trade-off relationships between the true positive rate (recall) and false positive rate in the ROC-AUC and between precision and recall in the PR-AUC.

RESULTS

Comparing compound chemical spaces for AI model construction and pharmaceuticals

We compared the chemical space of compounds in the dataset used for the AI model construction with that of pharmaceuticals currently circulated in the United States. The chemical spaces of compounds in the dataset and pharmaceuticals are shown in Fig. 2. The chemical space of the dataset used for AI model construction mostly encompassed the chemical space of pharmaceuticals.

Fig. 2

Comparison of Chemical Spaces of Compounds in the Study Dataset and Pharmaceuticals. Orange represents compounds positive for mitochondrial toxicity, blue indicates compounds negative for toxicity, and red indicates pharmaceuticals.

Constructing AI models

First, an AI model was constructed using a method that did not split the dataset. The F1 value for the external test was 0.707, which did not indicate adequate prediction performance (Table 2).

Table 2. Results of AI model prediction performance evaluation.

Models Data set F1 Precision Recall Accuracy ROC-AUC PR-AUC
Presented model* bagging External test 0.839 0.778 0.910 0.825 0.825 0.753
Internal test 0.809 ± 0.021 0.740 ± 0.043 0.896 ± 0.044 0.788 ± 0.026 0.852 ± 0.026 0.824 ± 0.049
Training 0.833 ± 0.015 0.784 ± 0.029 0.890 ± 0.036 0.823 ± 0.017 0.853 ± 0.025 0.830 ± 0.034
Comparison no bagging External test 0.707 0.906 0.580 0.760 0.760 0.736

*Performance metrics for both the training set and internal evaluation were derived as averages of those of the sub-AI models. The performance metrics in the external evaluation reflected the aggregated performance of the AI model, which integrated nine sub-AI models.

Therefore, we constructed an AI model via splitting the dataset: the bagging method. This AI model showed an F1 score of 0.839 in the external test and all other evaluation indices showed high values (Table 2).

Structural alert visualization

The newly developed AI model, constructed using kMoL, was based on the IG method (Sundararajan et al., 2017), which can visualize substructures that contribute to toxicity. Examples include fenofibrate (Fig. 3) and phenols (Fig. 4).

Fig. 3

Structural Alerts for Fenofibrate Visualized using the Nine Sub-Models Composing the AI Model. In the structure, red indicates substructures contributing to a positive decision; the darker the color, the higher the contribution. Prediction 1.0 represents compounds positive for mitochondrial toxicity, and prediction 0.0 for those negative. The alert at the central bottom indicates a negative prediction result, whereas the results from the other models indicate positive predictions. The area framed in red indicates the bis-aryl ketone structure.

Fig. 4

Structural Alerts in Compounds with Phenol Motifs: a) 4-Nonylphenol, b) 4-tert-Butylphenol, c) 2,3,6-Trichlorophenol. These represent typical structural alerts predicted by the nine sub-models.

Fenofibrate induces mitochondrial toxicity via alterations in the mitochondrial electron transport chain (Brunmair et al., 2004). The bis-aryl ketone structure was identified as a structural alert (Uda et al., 2020). Our model predicted fenofibrate as positive, with an 8:1 vote outcome from the sub-models. Figure 3 shows the visualization results for the nine sub-models. All eight sub-models that predicted a positive outcome indicated that the bis-aryl ketone structure was the substructure that contributed to the positive results. Conversely, the single model that predicted a negative outcome did not identify the bis-aryl ketone structure that contributed to toxicity.

Phenol induces mitochondrial toxicity via uncoupling oxidative phosphorylation. The hydroxyl group attached to the benzene ring has been reported to be a structural alert (Naven et al., 2013). Figure 4 shows the visualization results for the phenolic compounds included in the external test set. The external test set contained three positive compounds with phenol motifs (4-nonylphenol, 4-tert-butylphenol, and 2,3,6-trichlorophenol). All three compounds were correctly predicted to be positive, with the hydroxyl group in the structure indicated as a substructure contributing to the positive result.

Application to elucidate DILI (drug-induced liver injury) mechanisms

Mitochondrial toxicity is the key mechanism underlying hepatotoxicity. Therefore, we explored the application of our AI model to elucidate the mechanisms of hepatotoxicity. Hepatotoxic compound information was obtained from DILIst (Thakkar et al., 2020). The external test set included 17 positive and six negative compounds listed in DILIst (Table 3). Our AI model successfully predicted the mitochondrial toxicity of 20 out of 23 compounds. The mitochondrial toxicities of 16 of 17 DILI-positive compounds were correctly predicted. Among these, DILI and mitochondrial toxicity were associated with 11 compounds.

Table 3. Cross table for DILI and mitochondrial toxicity (Mitotox).

Mitotox-positive Mitotox-negative
DILI-positive albendazole
amineptine
benzbromarone
bithionol
candesartan
fenofibrate
imipramine
methyltestosterone
nimesulide
pirarubicin
sofalcone
tolcapone
chlorpropamide
lamotrigine
methapyrilene
nicorandil
trazodone
DILI-negative aripiprazole
calcitriol
vinblastine
methoxamine
salbutamol
triazolam

DILI: compounds with positive and negative hepatotoxicity in DILIst.

Mitotox: Compounds with positive and negative mitochondrial toxicity in the dataset used to construct the AI model.

Bold letters indicate compounds for which the AI model correctly predicted the presence or absence of mitochondrial toxicity.

DISCUSSION

The AI model was constructed using two approaches: one without splitting the dataset and the other using the bagging method, which involved dividing the dataset into subsets. The model constructed without dataset splitting did not demonstrate a sufficiently high predictive performance (F1 score of 0.707). This modest predictive performance could be explained by a training data imbalance, wherein the negative data significantly outnumbered the positive. Therefore, we utilized the bagging method with under-sampling, an ensemble learning method, to construct an AI model. This method is effective for creating models with imbalanced datasets because it can train an imbalanced dataset as if it were balanced; it also helps reduce the risk of overfitting. In the method, the final judgment is made based on the majority vote of multiple constructed sub-models. We divided the dataset for AI model construction into nine subsets. A few positive datasets were shared across all nine subsets, whereas negative data were allocated to each subset to match the number of positive data. Thus, we constructed a mitochondrial toxicity prediction AI model with high discriminative performance, demonstrating an F1 score of 0.839. This outcome highlights the utility of the bagging method for constructing AI models with imbalanced datasets.

It should be noted that our model employs random seeds during the training data splitting process. This approach inherently introduces performance variation attributable to these seeds. Owing to high computational demand, it was not feasible to exhaustively test every scenario. As a result, we acknowledge the potential existence of undiscovered models with superior performance.

A notable aspect of our current model was its high recall coupled with lower precision. In the context of imbalanced data, post-under-sampling calibration in machine learning prediction has been proposed (Dal Pozzolo et al., 2015). This method typically yields conservative predictions for the minority class and more assertive ones for the majority class, often leading to lower probability estimations. However, upon applying calibration to our AI model, we observed an increase in precision (0.778 to 0.860), accompanied by a decrease in both recall (0.910 to 0.800) and F1 score (0.839 to 0.829). A decrease in recall values implies a higher likelihood of missing positive predictions. Considering the importance of not overlooking compounds with high toxicity potential during the early stages of drug discovery, a model with a higher recall value would be preferable when the F1 score remains largely unaffected. This led us to the decision against implementing this calibration method.

Table 4 presents the mitochondrial toxicity prediction models developed using machine learning techniques and their predictive performance outcomes to date. The accuracies of these models varied between 0.771 and 0.895, and their F1 scores ranged from 0.591 to 0.855. Our AI model achieved an accuracy of 0.825 and an F1 score of 0.839 in external evaluations. Two previously developed models (Zhao et al., 2021; Jaganathan et al., 2022) demonstrated high predictive capabilities, with our AI model exhibiting similar performance. However, our model uniquely features the ability to visualize toxicity predictions and substructures that contribute to positive results. For instance, it successfully predicted structural alerts previously reported for fenofibrate and phenols (Figs. 3 and 4). This capability was absent in models from previous studies, indicating a significant advantage of our AI model. Therefore, our AI model is the first mitochondrial toxicity prediction model that combines high predictive accuracy with influential substructure visualization capabilities. This feature positions our AI model to aid in selecting candidate drug compounds and in SAR studies to mitigate mitochondrial toxicity.

Table 4. Predictive performance of mitochondrial toxicity prediction models developed in previous research.

Year Author Method with best performance Accuracy metrics Best score F1 score*
2009 Zhang H., et al. SVM Accuracy 0.771 0.765
2020 Hemmerich J., et al. Deep learning Balanced accuracy 0.895 0.591
2020 Tang W., et al. Random forest Balanced accuracy 0.883 n/a
2021 Zhao P., et al. Random forest Accuracy 0.853 0.831
2021 Bringezu F., et al. XGBoost Balanced accuracy 0.800 n/a
2022 Jaganathan K., et al. CatBoost Accuracy 0.871 0.855

*Scores are either directly reported or derived from reported values.

We aimed to construct an AI model that could be used for drug safety evaluation. Therefore, the predictive capability of our AI model had to encompass the chemical space relevant to pharmaceuticals. A comparative analysis revealed that the chemical space of our dataset used to construct the AI model generally covered the pharmaceutical chemical space. However, the chemical space of pharmaceuticals is expanding continuously, necessitating model updates using new datasets. In addition, our AI models may not exhibit optimal prediction performance when applied to detailed SAR studies on specific chemical structures. In such cases, improving the predictive accuracy of the model may require additional training using datasets containing similar compounds.

Mitochondrial toxicity is a primary mechanism underlying hepatotoxicity. The liver, which is the energy production and storage center, is rich in mitochondria. Consequently, compounds exhibiting significant mitochondrial toxicity are associated with a high risk of liver damage. Numerous instances of hepatotoxicity have been linked to mitochondrial dysfunction (McGill et al., 2012; Spaniol et al., 2001; Tong et al., 2005; Chariot et al., 1999). Our AI model could accurately predict the mitochondrial toxicity of compounds known to induce hepatotoxicity. This suggests that our AI model is valuable for elucidating mechanisms underlying hepatotoxicity.

We developed an AI model for predicting mitochondrial toxicity that demonstrated high predictive accuracy via integrating a GNN with bagging methods. Mitochondrial toxicity is a key screening parameter in the drug discovery phase. Thus, our AI model is expected to play a significant role in selecting and prioritizing drug candidates. The AI model is also distinguished by its ability to visualize structural alerts and predict mitochondrial toxicity. This functionality could be invaluable for altering chemical structures to mitigate the risk of mitochondrial toxicity.

ACKNOWLEDGMENT

This study was supported by the Japan Agency for Medical Research and Development (AMED) under Grant Number JP23nk0101111.

Conflict of interest

The authors declare that there is no conflict of interest.

REFERENCES
 
© 2024 The Japanese Society of Toxicology
feedback
Top